Online-Academy

Look, Read, Understand, Apply

Menu

Data Mining And Data Warehousing

Q and A

Define data mining.
Data mining is technique of extracting hidden, unknown knowledge from a huge volume of data.
What is data pre-processing in data mining?
Data pre-processing is process of preparing data for data mining task. Data pre-processing removes redundancies from data, makes data consistent, error-free, transforms, data to format required by the data mining algorithms, reduces size of data.
What are the methods of data transformation?
  • Min-Max normalization
  • Z-score normalization
  • Decimal scaling
What are the methods for data reduction?
  • clustering
  • correlation analysis
What is smoothing?
Smoothing is the technique to make data consistent. Binning by average, binning by boundaries are smoothing techniques.
What is data cube?
Data cube is representation of summarized data along several dimensions. Data cube allows to view and analyze data in multiple dimensions.
What are dimensions and facts in star schema?
Dimensions are object or entities or perspectives about which organizations store data; time , item, location, etc. are dimensions. Facts are numerical measures by which organizations analyze relationship between dimensions; units_sold, total_sold etc. are examples of facts.
What is advantage of having data warehouse?
Data warehouse can provide competitive advantage as it presents relevant information; it helps to enhance productivity of organization as it is able to collect data quickly and efficiently; it provides consistent view of customers and helps to improve relationship with customers; it can help to reduce cost as it helps to track trends, patterns, exceptions over long periods consistently.
What is three-tier architecture of data warehouse?
Bottom Tier is data warehouse server, a relational database; The middle tier is an OLAP server (ROLAP: extended relational database which maps multidimensional data to standard relational operations); (MOLAP: a special purpose server that directly maps multidimensional data and operations.) The Top tier is a user interface, which contains query and reporting tools, analysis tools, data mining tools.
what are the major components of data mining system?
  • Database, data warehouse, Word Wide Web
  • Database or data warehouse server
  • Knowledge base
  • Data Mining engine
  • Pattern evaluation module
  • User Interface
Describe Apriori Algorithm.
Apriori algorithm is an iterative approach.
  • First, set of frequent 1-itemsets is found by scanning database, items not satisfying minimum support are discarded, resulting set is denoted by L1.
  • Cross product of L1 with itself, L1, is taken to find set L2 of Frequent 2-itemsets ignoring items not satisfying minimum support.
  • L2 is used to find L3 set of frequent 3-itemsets and so on.
How Apriori algorithm can be improved?
By transaction reduction, A transaction not containing any frequent k-itemsets cannot contain any frequent (k+1)-itemsets.
Explain Frequent Pattern (FP) -growth algorithm.
FP-growth adopts divide-and-conquer strategy. First, it creates a frequent-pattern tree from database representing frequent items. This tree retains itemset association information. Secondly, it creates conditional database from frequent pattern tree, each associated with one frequent item or pattern fragment, and mines each database separately.
What kind of data can be mined?
Data mining can be performed on number of different data repositories:
  • Relational Databases
  • Data warehouses
  • Transactional Databases
  • Advanced database systems
  • Flat files
  • Data streams
  • World Wide Web (WWW)
What is transactional database?
A transaction includes a transaction identifier, and a list of items such as products purchased in a store. A transactional database consists of a file representing a transaction.
What kinds of Patterns can be mined?
Data mining tasks can be classified into two categories: Descriptive and predictive
Descriptive mining defines the general properties of data. Predictive mining performs inference on the current data to make predictions.
Data mining functionalities are:
  • Concept class description
  • Mining Frequent Patterns, Associations
  • Classification and Prediction
  • Cluster Analysis
  • Outlier Analysis
  • Evolution Analysis
Are all the patterns generated by data mining tasks interesting?
It is not necessary. Answer to following questions are necessary:
  • is pattern easy to understand by human users?
  • is pattern valid, useful for human users?
How to measure interestingness of pattern?
Two measures for validate interestingness of patters:
  • Support for X->Y: represents percentage of transactions from database that the given rule satisfies.
  • Confidence for X->Y: represents the degree of certainty of the association.
What is data cleaning?
Data cleaning is process of filling in missing values, smoothing out noise, correct inconsistencies in the data.
How missing values can be filled?
  • Ignore tuple
  • Fill in the missing value manually
  • Use a global constant to fill in the missing values
  • Use the attribute mean to fill in the missing values
  • use the most probable value
What is noise?
Noise is a random error or variance in a variable.
How to smooth out noise from numeric data?
Different binning methods, regression analysis, clustering can be used to smooth out noise from numeric data.
How redundancies in data can be detected?
Correlation analysis can be used to detect some redundancies. If between two attributes, growth or decline in one attribute is seen and at the same time growth or decline in another attribute is also seen, then one of the attributes can be considered as redundant.
What is data transformation?
Data transformation process of transforming data to the form appropriate for data mining. Some of the data transformation techniques are:
  • Smoothing
  • Aggregation
  • Generalization
  • Normalization
  • Attribute construction
What is data reduction?
Data reduction is process of generating reduced representation of data set without losing integrity, meaning of original data. Data can be reduced using following techniques:
  • Data cube aggregation
  • Attribute subset selection
  • Dimensionality reduction
  • Numerosity reduction
  • Discretization and concept hierarchy generation
What is classification?
Classification is a process of grouping data, elements, objects based on their similarity. Objects (elements) have attributes; similarity between the objects is calculated using the attributes of the objects. Decision tree, Neural network, Bayesian classifier are classification algorithms.
What is association analysis?
Association analysis is the process of finding relation or association between objects (elements); presence of one object influences presence of another object. If A is purchased then X will also be purchased is an example of association. Market Basket Analysis is an example of association analysis. Apriori Algorithm, Frequent Pattern Tree (FP Tree) are association analysis algorithms.
What is clustering?
Clustering is grouping of objects based on closeness between the objects. Clustering is different from classification as clustering is non-supervised process and classification is supervised process. K-means, K-medoids, DBSCAN, Agglomerative, Divisive are clustering algorithms.
What are the processes involved in data warehousing?
The processes involved in data warehousing are:
  • Extract and load
  • Clean and Transform
  • Backup and Archive
  • Query processing
What are the components of data warehouse for handling data warehousing processes?
ProcessesManagers
Load and extractLoad Manager
Clean and TransformData Warehouse Manager
Backup and ArchiveData Warehouse Manager
Query ProcessingQuery Manager
How is database size calculated for setting up a Data Warehouse?
Following entities are included to calculate size required for setting up a Data Warehouse:
  • Size of Fact Data
  • Size of Aggregate Data
  • Size of Dimension Data
  • Size of Index data of Fact Data
  • Size of Index data of Aggregate Data
  • Size of Index data of Dimension Data
  • Temporary Space
  • Sort Space
Why are aggregate tables created in data warehouse?
Aggregate tables are created to speed up query response time. User's ask several queries, creating summary tables instantly will be a time consuming process, users' have to wait long to get their responses. If the aggregate tables match the user query then user will get immediate response.
What is Bayesian Classifier?
Bayesian classifier is a statistical classifier which can predict the probability of a tuple belonging to a particular class. Bayesian classifier is based on Baye's theorm. Bayesian classifier is called naive as it assumes that the effect of one attribute on a class is independent of the values of other attributes.
How Bayesian classifier works?
  • Let D be a training data set with associated class labels, C1,C2,Cn.
  • Let a tuple X be represented by n-dimensional attribute vector.
  • X will belong to the class having the highest posterior probability conditioned on X.
What is backpropagation?
backpropagation is a neural network learning algorithm. Neural network is defined as a set of connected input/output units, each connection have a weight associated with it. Neural network learns by adjusting the weights and is called connectionist learning.
High tolerance of noisy data and ability to classify patterns on which it is not trained are advantages of neural network.
How does backpropagation work?
The error value between target value and predicated value is calculated, then error value is propagated backwards from output layer to hidden layer, from hidden layer to input layer to adjust the weights of the connections. Neural Network with adjustment of weights from backward to forward is called back propagation neural network.
What is decision tree?
Decision tree is a tree structure with internal nodes and leaf. Internal nodes denote test on an attribute, branch represents an outcome of the test, and leaf node holds class label. Topmost node in a tree is the root node.
How is decision tree used for classification?
Attribute of a tuple, X with unknown class is tested against the decision tree; path from root to the leaf is traced; the leaf node provides the class label for the given tuple, X.